20.1 Hash Tables BSTs provide add, contains, and remove operations in O(log n) time. Can you create a structure that improves on O(log n) time? Yes Hashing can implement a Set in constant time per operation Suppose you need to create a set of employee ID numbers. Each ID number is between 1 and 1000. How can you store the numbers so you can add/contains/remove in O(1) time? make an array of size 1000 index the array with the item value add(i) array[i]++ contains(i) return array[i] > 0 remove(i) array[i]-- Could you use this approach to store items (like Strings) that are not numbers? yes, convert the strings into numbers characters are represented by codes (commonly from 0 to 127) treat a string as a base-128 number Could you use this approach to store larger numbers? Suppose you are storing only 1000 different numbers, but the numbers range in value from 1 to 1000000. Would you make an array of size 1000000? make an array closer to size 1000 and use a hash function to map large numbers into smaller numbers What's a Hash Table? an array used to store items in a collection items are stored in the array at the index given by a hash function What's a Hash Function? a function that maps an item into a number the number is in the range of a valid array index the function usually contains a mod by table size (x % tableSize) What's a Collision? when two or more items hash to the same number resolve with chaining, linear probing, quadratic probing 20.2 Hash Function What are the characteristics of a good Hash Function? gives a number between 0 and tableSize-1 easy/fast to compute distributes items evenly throughout the hash table Would it make sense to use random numbers in a hash function to help ensure an even distribution of items? What's a typical hash function for integers? value % tableSize A hash function for Strings must convert a String to a number. What's wrong with this code for converting a String to a number? public static int hash(String key, int size) { int hashVal = 0; for (int i = 0; i < key.length(); i++) hashVal += key.charAt(i); return hashVal % size; } same characters in different orders (tab, bat) give same result gives only small numbers (10 chars gives at most 1280) How is this hash function better than the previous one? public static int hash(String key, int size) { int hashVal = 0; for (int i = 0; i < key.length(); i++) hashVal = (hashVal * 128 + key.charAt(i)) % size; return hashVal; } characters are weighted by their positions so different orders give different numbers and larger numbers can be produced How is this hash function better than the previous one? What problem does this code have with long strings? public static int hash(String key, int size) { int hashVal = 0; for (int i = 0; i < key.length(); i++) hashVal = hashVal * 128 + key.charAt(i); hashVal %= size; if (hashVal < 0) hashVal += size; return hashVal; } the expensive mod operation is not repeated on every loop with long strings the early characters are lost from the answer How is this hash function better than the previous one? public static int hash(String key, int size) { int hashVal = 0; for (int i = 0; i < key.length(); i++) hashVal = hashVal * 37 + key.charAt(i); hashVal %= size; if (hashVal < 0) hashVal += size; return hashVal; } the early characters are not lost as quickly What's a typical hash function for Java Objects? the hashCode method maps the object to an integer item.hashCode() % tableSize How do you implement hashCode for classes you write? combine the hashcodes for the parts of the object that makeup the key public class Vehicle { int state; // key String id; // key String make; String model; int year; public boolean equals( Object obj ) { if ( obj instanceof Vehicle ) { Vehicle other = (Vehicle)obj; return state == other.state && id.equals( other.id ); } else { return false; } } public int hashCode() { return state + id.hashCode(); } } 20.5 Chaining How do you resolve collisions using Chaining? the hash table is an array of linked-lists the hash function tells to which list an item is added and in which list an item can be found Show the result of inserting 2, 7, 3, 8 into a hash table of size 5. What's the Big-Oh bound of add/contains/remove when chaining is used? worst-case is O(n) (all items hash to the same bucket) average-case is O(n/m) (items distributed evenly over the buckets) What's the Load Factor of a hash table? lambda the fraction of the table that is full n/m is the load factor What's a good Load Factor to use with chaining? 1.0 lower doesn't significantly improve performance higher can save space Which is better Hashing or AVL trees? trees support order hash is O(1), trees are O(logN) hash is simpler